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Using data from the Microarray Quality 
Control (MAQC) project, we demonstrate 
two data-analysis methods that shed light 
on the normalization of gene expression 
measurements and thereby on their 
technical variation. One is an improved 
method for normalization of multiple 
assays with mRNA concentrations related 
by a parametric model. The other is a 
method for characterizing limitations on 
the effectiveness of normalization in 
reducing technical variation. We apply 
our improved normalization to the four 
project materials as part of testing the 
linearity of the probe responses. We find 
that the lack of linearity is statistically 
significant but small enough that its 
sources cannot be easily identified. 
Applying our characterization method to 



assays of the same material, we show that 
there is a source of variation that cannot be 
eliminated by normalization and therefore 
must be dealt with by other means. Four 
high-density, single probe, one-color 
microarray platforms underlie our 
demonstration. 
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1. Introduction 

Monitoring the expression of thousands of genes 
simultaneously for the purpose of obtaining biological 
insight requires an effective approach to the technical 
variation. Particulars of technical variation are often 
sought through an inter-laboratory study, of which the 
Microarray Quality Control (MAQC) project is an 
example. Using data from the MAQC project, we 
demonstrate statistical modeling techniques that lead to 
better understanding of the technical variation in gene 
expression assays. These techniques have had little pre- 
vious application in microarray data analysis. In Sec. 4, 
we discuss the relation of conclusions based on these 
techniques to other MAQC results. 

In the MAQC project, gene expression levels were 
measured from two high-quality, distinct RNA refer- 
ence materials and two mixtures of these materials. 
Data from four microarray platforms of the seven in the 



MAQC project provide the basis for the demonstrations 
in this paper. These platforms are 1 



Manufacturer 


Code 


Platform 


Applied Biosystems 


ABI 


Human Genome Survey 
Microarray v2.0 


Illumina 


ILM 


Human-6 BeadChip 48K vl.O 


Agilent 


AG1 


Whole Human Genome Oligo 
Microarray, G4112A 


GE Healthcare 


GEH 


CodeLink Human Whole 
Genome, 300026 









Certain commercial equipment, instruments, or materials are iden- 
tified in this paper to foster understanding. Such identification does 
not imply recommendation or endorsement by the National Institute 
of Standards and Technology, nor does it imply that the materials or 
equipment identified are necessarily the best available for the 
purpose. 
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With some extensions, our techniques could be applied 
to the data from the other three microarray platforms. 
Each microarray platform was deployed at three test 
sites, and five replicates were assayed at each site. 
Thus, for each platform, the MAQC project produced 
sixty assays. The complicating feature in this otherwise 
simple inter-laboratory study is the large number of 
expression levels measured in each assay. This feature 
leads us to the use of novel statistical methods for 
investigating technical variation. 

In its original form, a gene expression assay consists 
of many probe responses, which are generally intensity 
readings from a scanner. In the cases of the platforms 
considered here, the number of probe responses is 
greater than 30,000, although we only select some of 
these responses for our demonstrations. Each probe is a 
distinct piece of nucleic acid that is immobilized on the 
microarray substrate. In the case of gene expression, 
each probe interrogates an mRNA sample to determine 
the relative abundance of a particular transcript among 
all the transcripts that make up the sample. Hybridi- 
zation is the process of applying an mRNA sample to 
the probes that make up a microarray. 

Sources of technical variation, the focus of this 
paper, affect the probe responses. Some sources of 
technical variation affect most probe responses shifting 
all of them up or all of them down. The effects of such 
sources can be corrected in part through what is called 
normalization. Two basic choices of normalization 
method are usually considered. One is assay-to-assay 
equalization of the intensity distribution [1]. The other 
is use of external RNA controls [2, 3]. The part of the 
technical variation that is most important is the part that 
normalization cannot correct. 

A third normalization choice can be applied to probe 
responses from the MAQC project. Each laboratory 
measured four reference materials, two of which are 
mixtures of the other two. Because of this mixing, the 
expression levels in the materials are all related by a 
parametric model: The expression levels in the mixture 
materials are known proportions of the levels in the 
original materials. As this paper demonstrates, this para- 
metric model can be the basis for a different type of 
normalization. Although this normalization is potential- 
ly useful in reaching biological conclusions, our imme- 
diate purpose for introducing it is testing the linearity of 
the probe responses. 

Another aspect of normalization is limitations on its 
effectiveness. The basis of normalization, regardless of 
the type, is assumption of a relation among the effects 
of sources of technical variation on the intensity meas- 



urements from an array. Equalization of intensity distri- 
butions depends, at a minimum, on the assumption that 
the technical variation shifts all the intensities of an 
assay in the same direction. Use of external controls 
depends on an assumption that allows changes in the 
control intensities to be extrapolated to the other inten- 
sities. In this paper, we use the MAQC measurements 
to show limitations on normalization due to more 
complicated intensity co-variation. 

This paper provides results on the technical variation 
observed through measurements of different materials 
made in the same laboratory and through measurements 
of the same material made in three different laborato- 
ries. The two sets of results are based on different sta- 
tistical models, which are discussed in the next section. 
The results themselves appear in Sec. 3 and are dis- 
cussed in Sec. 4. Finally, details on the methods are 
given in Sec. 5. 

2. Statistical Models 

Implementation of our normalization method, which 
we call parametric normalization, requires a mathemat- 
ical model that faithfully describes the relations among 
assays. For a particular laboratory, consider the intensi- 
ty value (probe response) for assay / and probe g, which 
we denote by y ig , where i = 1, ..., 20. The 20 assays are 
5 hybridizations of material A, 5 of material B, 5 of 
material C, and 5 of material D, respectively. Para- 
metric normalization as implemented here consists of a 
linear transformation of all the intensities of an assay. 
Thus, after normalization, the intensity for assay i and 
probe g is given by 

y* - lot 



n, 

where % and 77, are the normalization coefficients for 
assay i. Our parametric normalization model is 



GV -*7o,)/»7, 



x Ai o Ag + x Bi e Bg +c ig % 



(i) 



where for materials A, B, C and D, x Ai equals 1,0, y c or 
(1 - Yd), respectively, and x m equals 0, 1, (1 - y c ) or y D , 
respectively. This equation represents intensities in 
terms of mean intensities, 9 Ag and 9 Bg , and noise. For 
material A, the right side is 9 Ag + o ig e ig , and for material 
B, Q Bg + c ig e ig . The quantity e fe is a random variable with 
mean zero and variance one. The standard deviation 
of the noise c ig is shown as a function of assay i and 
probe g. In this model, the fraction of material C that 
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came from material A is y c , and the fraction of material 
D that came from material B is y D . On the basis of this 
equation, parametric normalization is performed by 
fitting the unknown parameters %, r\ h Ag , and Bg to 
the observed intensities y ig and then using the estimated 
values of % and rj, for normalization. The algorithm is 
given in Sec. 5. 

Investigation of the limitations on normalization also 
requires a mathematical model but of a different type. 
Consider measurements of the same material given as 
intensities on the base 2 logarithmic scale. We denote 
these intensities by z ig , where i= 1, ..., 15 indexes the 
five assays from each of three laboratories. Computa- 
tion of the correlation matrix is a familiar summariza- 
tion of such data. Let the covariance between assays i 
and j be 

rE( z *-^X z *-^)> (2) 



G-\ 



where z, is the mean of the log intensities over the 
probes for assay i and G is the number of probes. The 
familiar correlation between assays / and j is given by 



s ij i ^j s ii s ji ■ 



Our approach is based on modeling the covariance 
between assays i and j, which is estimated by s ip using 
a factor analysis model with two common factors. The 
covariance model is 



Xl + x 2 , 



+ Wi 



^Aji + h a x j2 



if / = j 
if i*j. 



This covariance modeling is equivalent to modeling 
the z ig by means of the factor analysis model 



z ig =H l +* l ifi g +^24 + V/% , 



(3) 



where the f lg ,f 2g , and e ig , i = 1, ..., 15 are all indepen- 
dent, normal random variables with mean and vari- 
ance 1. In this model, the random variables/^ andf 2g , 
represent the probe-to-probe variation in the common 
factors and the parameters X n and X a govern the contri- 
butions of the common factors to the individual assays. 
This modeling gives the (log) intensity distribution 
for assay i as normal with mean //, and variance 
X\ + X] 2 +y, ■ The limitation arises because equaliza- 
tion of the intensity distribution has only this mean and 
variance available for adjustment and not the X n and X a 
individually. 

The results in this paper are based on modeling the 
technical variation in four platforms. From platform to 



platform, the technical variation as portrayed by these 
two models is similar. We leave aside gene-level mod- 
eling of the differences among platforms [4]. We do not 
consider the question of inter-platform reproducibility, 
the main subject of many studies [5, 6]. 

3. Results 

3.1 Parametric Normalization 

To assess the linearity of the probe responses after 
parametric normalization, we determine how well the 
mixture model fits the normalized measurements. Our 
measure of lack of fit is a weighted sum of squared 
deviations, which we compute for each probe. For a 
particular laboratory, let the normalized intensities 
for probe g be denoted by u ig , i=l,...,20. From 
these intensities, we estimate Ag and Bg denoting the 
estimates by Ag and Bg . The averages of the five 
normalized intensities for each material deviate from 
the values computed with the mixture model in 
accordance with 

U A g - Qjg 
u Bg - Bg 

rJ Ag -^-r c )L' 



*Cg 



^-r D )o Ag -r D 0, 



Note that these deviations reflect departure of the nor- 
malized probe responses from linearity. Lack-of-fit for 
each probe is indicated by the weighted sum of squares 
of these deviations 

w a s &A g - d Ag f + w Bg (u Bg -e Bg f 
+w c g (" Cg - rJ Ag - (i - Yc )4 g ) 2 
+w Dg (u Dg -(i-y D )e Ag -y D e Bg Y , 



where the weights w 



Ag; 



v Cg 



, and w Dg are inversely 



proportional to O ig . 

The amounts of RNA material mixed to form 
materials C and D were in 3 to 1 and 1 to 3 ratios. Were 
the fraction of mRNA in materials A and B the same, 
we would have y c = y D = 0.75. However, these fractions 
are not exactly equal. Based on our analysis of the 
MAQC data and a complementary analysis [7], we 
have adopted y c = 0.80 and y D = 0.69 as values for our 
deviations. In our analysis, we obtained estimates of 
y c and y D from each of the four platforms we consider. 
The estimates given by different platforms are in reason- 
able agreement. 
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One way to interpret our deviations is to compare 
them with what would be expected were the mixture 
model to fit perfectly [8]. We obtain a standard devia- 
tion from each of the five normalized intensities for a 
material and denote these by s Ag , s Bg , s Cg , and s Dg . The 
comparison involves the ratio of the sum of squared 
deviations with 



J_ 
10 



[V* + V* +,( !»4H 



£]■ 



The scale factor 1/10 depends on specifics of the MAQC 
design and the number of unknown parameters in the 
mixture model. If the mixture model fits, this ratio 
is F distributed, that is, distributed as the ratio 
of two independent variances. The numerator and 
denominator degrees of freedom are 2 and 16. 

To complete this interpretation of lack-of-fit, we com- 
bine the F values from all the probes into a distribution 
and compare this observed distribution with the F distri- 
bution for 2 and 1 6 degrees of freedom. We make this 
comparison with a quantile-quantile plot. Each point on 
this plot corresponds to a probe. The y value is the natu- 
ral logarithm of the F value for this probe. Thus, a y 
value corresponds to an observed quantile. The corre- 
sponding x value is obtained from the ranking of this F 
value among all the F values. The x value is obtained as 
follows: First, find the proportion of all the F values that 
are less than the particular F value. Let this proportion be 
m/n, where n is the total number of probes. Second, find 
the value such that the cumulative F distribution for this 
value is (m + 0.5)/«. The logarithm of this value is the x 
value for the probe. Thus, an x value corresponds to a 
quantile hypothesized on the basis of perfect fit to the F 
distribution. Quantile-quantile plots for the sites that 
made use of the Applied Biosystems platform (ABI) are 
shown in Fig. 1 . The methods section describes selection 
of probes for this figure. In addition to the points, there 
is the x = y line. If there were no lack of fit, the points 
would lie on this line. Apparently, there is some lack of 
fit, and the unanswered questions involve the implica- 
tions of this lack of fit. 

The size of the gap between the points and the x = y 
line depends on both the fit of the mixture model and on 
the denominator of the F ratio, which is proportional to 
the variance exhibited by the replicate measurements on 
the same material. To judge the gap in terms of the repli- 
cate variance, we can ask for the size of the factor that 
the denominator of the F ratio would have to increase for 
the points and the line to coincide. This factor is general- 
ly small as is shown by the fact that if the denominator 
were increased by a factor of 2.7, then the points would 



move downward 1 unit on the y axis. This suggests that 
the lack of fit exhibited in Fig. 1 is so small in compari- 
son to the noise level that diagnosis of this lack of fit 
might be difficult. Moreover, note that the gap is 
ambiguous as a performance measure since it increases 
with lack of fit but decreases with increasing noise level. 

For each of the sites in Fig. 1, the departure of the 
points from the x = y line is greater at the upper end. This 
shows that the upper tail of the observed distribu- 
tion is heavier than it would be were the F values truly F 
distributed. Generally, one might summarize the lack of 
fit as being most notable in a small fraction of the probes 
that have observed F values so large that they cannot be 
attributed to estimation error as given by the F distribu- 
tion. 

Another view of the sum of squared deviations is com- 
parison of its value under parametric normalization with 
its value under the manufacturer's specified normaliza- 
tion. To obtain this latter value, we use the manufactur- 
er's values for u ig but do not change the weighting in 
either the estimation of 9 Ag , and 9 Bg and or in the compu- 
tation of the weighted sum of squared deviations. 
Because the two normalizations may differ by a scale 
factor, we divide each sum of squares by the correspon- 
ding value of (9 Ag - 9 Bg ) 2 . We have compared one 
scaled sum of squares with the other for each of the sites 
that used the Illumina platform (ILM). We see that for 
the majority of the probes considered, parametric nor- 
malization gives a smaller scaled sum of squares, which 
indicates that parametric normalization provides a better 
fit. This, of course, is not surprising because the paramet- 
ric normalization is optimized on the basis of the mixture 
model whereas the manufacturer's normalization does 
not make use of such knowledge. 

In characterizing the non-linearity of probe responses, 
one must apply normalization but, ideally, normalization 
that does not introduce non-linearity by itself. Figure 2 
provides lack-of-fit quantile-quantile plots for the data 
under parametric normalization and under manufactur- 
er's prescribed normalization. That the parametric nor- 
malization gives F statistics closer to the x = y line shows 
that parametric normalization is better for characteriza- 
tion of the inherent non-linearity of the probe responses. 

Linearity of the probe responses is especially impor- 
tant in the determination of differential expression 
between two samples. We consider the differential expres- 
sion between materials A and B expressed as the base 2 
logarithm of the ratio of the two responses. We compute 
the differential expression as log 2 (9 Ag /9 Bg ). Figure 3 
compares differential expression determined under para- 
metric normalization with differential expression deter- 
mined under manufacturer's prescribed normalization. 
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Fig. 1. For ABI, quantile-quantile plots that compare the observed F statistics with the F distribution with 2 and 
1 6 degrees of freedom. Closeness to the x = y line indicates weakness of evidence of lack of fit 
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Fig. 2. For ILM data under parametric normalization (black open circle) and under manufacturer's prescribed 
normalization (gray solid circle), superimposed quantile-quantile plots that compare the two sets of observed F 
statistics with the F distribution with 2 and 1 6 degrees of freedom. Closeness to the x = y line indicates better 
linearity for parametric normalization. 
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Fig. 3. For ILM data, bias in differential expression under manufacturer's prescribed normalization when 
compared to parametric normalization. The differences in differential expression values are plotted versus the 
parametric normalization values. 



In Fig. 3, we treat parametric normalization as the base- 
line because it provides better linearity as shown in 
Fig. 2. The vertical dimension shows the bias intro- 
duced by the manufacturer's prescribed normalization. 
Note that the bias is toward higher differential expres- 
sion for material A. This figure shows that were it 
possible, there might be benefit in using parametric 
normalization in the interpretation of the results from a 
biological experiment. 

3.2 Effectiveness of Normalization 

We now switch from within-laboratory modeling of 
different materials to among-laboratory modeling of 
the same material. It has long been recognized that 
there are potentially substantial inter-laboratory differ- 
ences in outright measurement of intensities. For this 
reason, one would avoid comparison of measured 
intensities across laboratories if one could. However, 
biological considerations in the design of a study may 
require processing samples to be compared in different 
laboratories or in a single laboratory with sufficient 
time delay that the technical variation will have inter- 
laboratory characteristics. This sub-section shows the 
perils in such comparisons. 



Of course, the meaningfulness of the following char- 
acterization of inter-laboratory differences depends on 
data used for the characterization. We note the follow- 
ing: Mixing should not be apparent as a source of inter- 
laboratory variation because the materials were mixed 
in one laboratory and shipped to all the other sites in the 
project. Each platform manufacturer was motivated to 
choose the other two sites for their platform carefully 
and to harmonize protocols among the three sites. 
There was a pilot study that contributed to the harmo- 
nization. However, it is true that gauging how well the 
protocols agreed is difficult because of the multi- 
dimensional nature of microarray assays. Finally, gen- 
eralization from three sites per platform is subject to 
appreciable uncertainty. Thus, from the MAQC data, 
one cannot say how difficult it might be to overcome 
the inter-laboratory differences discussed. 

As discussed in the methods section, we select, for 
the Agilent platform (AG1), the 3000 probes for each 
material that have the highest intensity overall. 
Applying factor analysis to the set of material A inten- 
sities transformed to the log base 2 scale, we obtain 
Table 1. The symbols heading the columns in Table 1 
are those found in Eq. (3). The column labeled Center 
gives the mean of the 3000 log intensities for each 
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Table 1. For AG1 results on material A, estimated parameters of the factor analysis model given in Eq. 3. For each 
assay, the parameters give the distribution of the log (base 2) intensities. 





Site 1 


Center 


Specific Factor 


Common Factor 


Common Factor 




Hi 


w- 12 


K 


A, 2 


Site 1 


9.82 


0.26 


0.66 


0.72 


Site 1 


9.94 


0.25 


0.75 


0.64 


Site 1 


9.83 


0.25 


0.71 


0.68 


Site 1 


9.98 


0.23 


0.74 


0.65 


Site 1 


9.78 


0.25 


0.73 


0.65 


Site 2 


10.77 


0.07 


0.81 


0.59 


Site 2 


10.42 


0.07 


0.82 


0.57 


Site 2 


10.40 


0.09 


0.83 


0.55 


Site 2 


10.42 


0.16 


0.82 


0.56 


Site 2 


10.32 


0.09 


0.85 


0.52 


Site 3 


10.58 


0.10 


0.56 


0.82 


Site 3 


10.60 


0.11 


0.56 


0.83 


Site 3 


10.30 


0.04 


0.56 


0.83 


Site 3 


10.50 


0.08 


0.55 


0.83 


Site 3 


10.26 


0.07 


0.58 


0.81 



assay. Inclusion of this parameter in the model provides 
an elementary form of assay-to-assay normalization. The 
column labeled Specific Factor gives an estimate of the 
standard deviation of the noise that is independent from 
probe to probe. This column also shows some difference 
among laboratories. 

The columns labeled Common Factor complete the 
model of the distribution of the log intensities that 
our analysis provides. The intensities are centered at fi t 
and have spread given by the standard deviation 



I 



K+K+Yi 



Between assays / and j, the log intensi- 
ties have correlation given by 

X n X fl + X i2 X j2 



^(X] l+ X] 2+Wl )(X) l+ X) 2+Wj ) 



The amount of correlation depends on the similarity of 
parameters X n and X a , which model how the common 
sources affect each assay. 

The relation between the components of the two com- 
mon factors X n and X a is shown in Fig. 4 for all four 
materials. Normalization is generally based on differ- 
ences among distributions of the intensities of the assays. 
In terms of our model, this distribution depends on the 
distance of the points in this figure from the origin. Thus, 
normalization can be used to adjust the intensities for 
each assay so that the distance of each assay from the ori- 
gin is the same. However, Fig. 4 shows that such adjust- 
ment cannot eliminate the differences among assays. The 
oblique axes shown in each panel generally differentiate 
the part of the assay-to-assay differences that can and 



cannot be reduced by normalization. Apparently, there is 
a source of technical variation that must be dealt with not 
by normalization, but by other means. Such a source 
might be variation in some experimental circumstance, 
perhaps temperature or washing technique, that interacts 
with the strength of the hybridization bond. 

In addition to showing the effects of sources of varia- 
tion, Fig. 4 also shows clustering of the assays by labo- 
ratory, which portrays the magnitude of the among-labo- 
ratory variation compared to the within-laboratory varia- 
tion. Comparisons like this are central to metrology. In 
the case of univariate measurements, the comparison is 
between the within-laboratory variance and the among- 
laboratory variance and is called gauge R&R (repeatabil 
ity and reproducibility) [9]. In the case of high dimen- 
sional measurements such as gene expression assays 
with thousands of probes, it is common to apply some 
form of dimension reduction such as principal compo- 
nents analysis (PCA) followed by display of the 
measurements in the reduced coordinates [10, 11]. 
Irizarry et al. [12] discuss inter-laboratory variation in 
gene expression measurements. 

Instead of PCA plots, we obtain a perspective on the 
measurements that places more emphasis on the among- 
laboratory variation. Our approach is linear discriminant 
analysis. Like PCA, we find two linear combinations of 
the measurements from each assay. In the case of dis- 
criminant analysis, these combinations are called dis- 
criminant coordinates. The difference is that we find lin- 
ear combinations that maximize the separation of the 
laboratories with respect to the within-laboratory varia- 
tion. To do this, we must adopt penalized discriminant 
analysis [13, 14]. 
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Fig. 4. For AG1, plots of common factors showing assay differences with assay sites labeled. Shown are the relations of the differences to the 
origin. Distribution-based normalization might bring all the assays to the same distance from the origin, but it cannot bring the assays together in 
the other dimension. 



Based on the log 2 intensities for material A from the 
GE Healthcare platform (GEH), we derive discriminant 
coordinates with which we plot the 1 5 material-A assays. 
These points along with points for material B are shown 
in Fig. 5. We see that the groups of material- A points for 
each laboratory are separated. This is not surprising since 
we derived the discriminant coordinates from the mate- 
rial A measurements. 

What is more interesting about Fig. 5 is the clustering 
of the material-B points. The discriminant coordinates 
for material A also separate the material B assays by lab- 
oratory. Apparently, the measurements from each array 
have inherent in them a signature associated with the 
processing laboratory. Such a signature originates 
from one or more sources of technical variation. This 
signature is strong enough that it stands out despite 
the difference between materials A and B. Were we to 
add materials C and D to Fig. 5 in the same way 



as material B, we would obtain the same clustering for 
these materials. Reversing the roles of materials A and B, 
that is, deriving the discriminant coordinates from the 
material B measurements instead, gives a result similar 
to Fig. 5. 

Portraying among-laboratory differences in terms of 
discriminant analysis is particularly relevant to develop- 
ment of biomarkers on the basis of microarray measure- 
ments. Biomarkers can be developed from microarray 
measurements of samples from both cases and controls. 
Consider the possibility that the cases have been meas- 
ured in a different laboratory than the controls, perhaps 
for logistical reasons. What Fig. 5 shows is that among- 
laboratory differences could easily be mistaken for case- 
control differences. Moreover, this suggests that within- 
laboratory differences involving a considerable time 
period might also be mistaken for case-control 
differences. 
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Fig. 5. For GEH, assays labeled by material and site plotted with material A discriminate coordinates. 
Coordinates give the maximum inter-laboratory separation for results on material A. Plot shows that material B 
results have laboratory signatures like those of material A. 



4. Discussion 

As a resource for investigation of technical variation, 
the laboratories involved in the MAQC project have 
provided measurements with state-of-the-art replica- 
tion. Such measurements are of particular interest 
because the variation observed represents a technolog- 
ical challenge, not a circumstance that can be easily 
changed by closer attention to the requirements of the 
measurement protocol. 

An overview of the MAQC project is provided in 
Ref 15. This reference provides a more complete 
description of the MAQC data including parts of data 
set not considered in this paper. This reference also 
contains figures that depict the MAQC data. The title of 
this reference is problematical in that it implies that 
reproducibility is a qualitative property that a measure- 
ment system does or does not have. In metrological 
studies, reproducibility is a quantitative property that 
describes closeness of the agreement between the 
results of measurements on the same material. In the 
case of univariate measurements, reproducibility is 
usually described by a standard deviation. It is not clear 
what quantitative summary of the technical variation in 
microarray assays is sufficiently inclusive of all aspects 
of this technical variation. As demonstrated herein, 



statistical modeling can portray some aspects of repro- 
ducibility for gene expression assays. 

On the basis of statistical modeling, this paper char- 
acterizes the technical variation in four ways generally 
corresponding to the figures. First is the linearity of 
probe responses, which we characterize by comparison 
with appearances that can be expected from noise 
alone. Second is the amount of reduction of technical 
variation by off-the-shelf normalization, which we 
characterize by comparison with parametric normaliza- 
tion. Third is a fundamental limitation of normaliza- 
tion, which we characterize in terms of sources of vari- 
ation not subject to the usual normalization assump- 
tions. Fourth is a difference between laboratories, 
which we characterize by choosing a certain perspec- 
tive on this difference and comparing it with the differ- 
ence between materials A and B. Note that the figures 
except for Figs. 2 and 3 are each based on a different 
platform. Each characterization could be applied to 
each platform giving a total of twenty figures, which 
we do not display because each characterization gives 
similar results for each platform. 

The predominant features of these characterizations 
and the statistical models that they reflect can be 
expected to hold true in future applications of the plat- 
forms to biological studies. Prediction of properties of 
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the technical variation that one should expect in future 
studies is the purpose of inter-laboratory studies such as 
the MAQC project. This standard should be applied to 
the depiction of the MAQC data in Ref. 15. 

Common in the application of microarrays is screen- 
ing for a few differentially expressed genes. This para- 
digm influences the way people analyze measurements 
when trying to understand technical variation. In partic- 
ular, many of the graphical methods familiar in microar- 
ray analyses are centered on quantities with familiar 
meanings in the context of individual genes and have 
probe-to-probe independence as the foundation for their 
interpretation. In contrast, this paper focuses on assay-to- 
assay associations that involve many probes. Such asso- 
ciations generally point to a few sources of technical 
variation and, in addition, form the basis for normaliza- 
tion. The characteristics of technical variation we have 
identified are important in common applications of 
microarray technology, but the connection requires some 
thought. Moreover, this thought is different from the 
quality-control thinking that serves so well for univariate 
measurements. Univariate quality control has only limit- 
ed application to a measurement method that gives 
perhaps 50,000 responses at a time. 

Many scientists require a transparent link between the 
original data and analysis results before they are willing 
to trust a characterization. Since providing this link with 
a single figure does not seem possible, we have per- 
formed further analyses to verify our characterizations. 
Regarding Fig. 1, for example, the observation that 
points for site 2 lie below the x = y line seems to indicate 
violation of some assumption about the data besides 
model lack of fit. We have identified violation of the 
normality assumption as a possible cause. Regarding 
Fig. 4, we chose to do factor analysis with two common 
factors. Inclusion of a third common factor adds detail to 
the results. More broadly, we note that two different 
models of microarray intensity form the basis of this 
paper and that this raises the question of which model is 
more faithful to the physical processes underlying 
microarray measurements. Although better models give 
better results, models do not have to be exact to give use- 
ful results. Because we do not use the two models in the 
same context, they can each be useful without their being 
in agreement. 

The analyses in this paper indicate that technical vari- 
ation in microarray assays might sometimes be reduced 
through better approaches to normalization, but that 
reduction of the technical variation itself through 
improved measurement protocols and through more con- 
sistent adherence to these protocols might finally be 



needed. Moreover, the analyses provided should be con- 
sidered in developing experimental designs for micro- 
array studies and QC metrics for monitoring study 
progress. In particular, our analyses show technical vari- 
ation problems in comparing expression levels across 
sites and therefore potentially also over long time periods. 

5. Methods 

The original measurements, the responses of the 
probes before normalization, are the starting point for 
our approach. This is necessary because our approach 
involves insight into technical variation gained through 
study of normalization. We apply a novel normalization 
method, parametric normalization, as a way of observing 
non-linearity in the responses of the probes. In portray- 
ing the limitations on normalization, we compare 
among-laboratory variability with within-laboratory 
variability. This is a comparison we cannot make with 
measurements subjected to a type of normalization that 
reduces within-laboratory variation without similar 
reduction in the among-laboratory variation. 

5.1 Probe Selection 

We selected probes for each of our four analyses fol- 
lowing a similar process for each platform. We began 
with the 12091 probes in the commonly mapped set 
chosen by the MAQC organizers for inter-platform com- 
parison. For AG1 and GEH, we eliminated probes from 
this set based on the qualitative calls or on duplication of 
an identifier in the original data. For AG1, inclusion is 
based on the value "0" for control type, feature non- 
uniformity, and saturation. For GEH, inclusion is based 
on "L" or "G" as the qualitative call. Moreover, we 
eliminated any probe that does not meet the inclusion 
criterion for all sixty assays performed with the platform. 
We entirely eliminated probes with duplicate identi- 
fiers. Thus, we begin our selection with 12091, 12091, 
11705, and 9378 probes for ABI, ILM, AG1, and GEH, 
respectively. 

For each platform and each material, we next selected 
the 3000 probes with highest expression level. Consider 
material A. First, we transformed the intensities to the 
base 2 logarithmic scale. This required addition of a 
starter of 100 for AG1 and a starter of 0.3 for GEH. Next, 
we centered the logarithmic intensities for each assay at 
their mean. We then computed the mean over the 15 
assays for material A and on the basis of this mean select- 
ed the 3000 largest intensities. This gives a set of probes 
we call the A set. Similarly, we obtained the B set, the C 
set and the D set. 



370 



Volume 111, Number 5, September-October 2006 

Journal of Research of the National Institute of Standards and Technology 



Our lack-of-fit analysis and normalization comparison 
are based on combining the A set and the B set with an 
exclusive OR. In other words, the probes in these analy- 
ses are either in the A set but not the B set or in the B set 
but not in the A set. The number of probes is 1982 for 
ABI and 1984 for ILM. We applied factor analysis to 
single materials: the probes in set A, set B, set C and set 
D for materials A, B, C and D, respectively. We applied 
discriminant analysis to the material A responses from 
the probes in set A. We then calculated discriminant co- 
ordinates for the material B responses from the probes 
in set A. 

5.2 Parametric Normalization 

Our algorithm for parametric normalization is a com- 
bination of iterative reweighted least squares [16] and 
crisscross regression [17, 18]. The weighting is based on 
the suggestion of Rocke and Durbin [19]. 

For regression under the model discussed in Sec. 2, 
the weights are inversely proportional to the variance 
(Jig. Following Rocke and Durbin [19], this variance is 
proportional to the sum of the expression level squared 
( x Ai^A g + x Bi Bg ) 2 and a constant. We followed the 
approach of Rocke and Durbin [19] in finding the size of 
the constant relative to the size of the squared expression 
level. 

To obtain an initial estimate of the weights, we esti- 
mate the expression levels from the un-normalized inten- 
sities y ig using un-weighted least squares. Then, using 
these weights, we re-estimate the expression levels. 

The main iteration in our algorithm follows: First, we 
use the current estimate of the expression levels to esti- 
mate the normalization coefficients r/ 0; and r\ r These 
estimates are obtained by fitting the model 

to the un-normalized intensities with weighted least 
squares based in the current expression levels. Second, 
we re-normalize the intensities 



y* 



rioi 
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and use these to re-estimate the expression levels. This 
estimation is based on weights computed from the 
previous expression levels. We then begin the next 
iteration by re-estimating the normalization. 

5.3 Factor Analysis 

Separately for each material, we apply factor analy- 
sis to original AG1 intensities transformed to the log 



(base 2) scale. For the 3000 probes with highest expres- 
sion levels that are given by sets A, B, C, and D, respec- 
tively, it is not necessary to add a starter to make the 
responses all positive. Our factor analysis computations 
are based on the usual maximum likelihood algorithm 
[20]. 

5.4 Discriminant Analysis 

We apply discriminant analysis to original GEH inten- 
sities transformed to the log (base 2) scale with a starter 
of 0.3 added. We select the set A probes for both materi- 
als A and B. Because the starter is necessary for only one 
intensity value, we could as easily use another approach, 
but this would not lead to any noticeable change in 
Fig. 5. The classical approach to discriminant analysis 
does not apply in our situation because there are many 
more probes than replicate assays. For this reason, our 
derivation is based on penalized discriminant analysis 
[13], which entails adoption of a penalty function. As 
inputs to their algorithm, we take the centered log base 2 
intensities. This centering is an elementary form of nor- 
malization. Our choice of penalty function is the identi- 
ty matrix multiplied by 1000. Hastie and Tibshirani [14] 
provide a needed part of the algorithm. 
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